---
title: Extracting Text from PDF Pages
description: Learn how to extract text content from PDF pages using PDFium
searchable: true
---

# Extracting Text from PDF Pages

## Description

Text extraction is one of the most common operations when working with PDFs. PDFium provides powerful text extraction capabilities through its WebAssembly API. This guide explains how to extract text from PDF pages, which is useful for searching, indexing, and making PDF content accessible.

## Interactive Example

This interactive example demonstrates how to extract text from PDF pages. You can navigate between pages to see the extracted text from each page:

import ExtractTextFromPdfDemo from '../code-examples/extract-text-from-pdf.preview';

<ExtractTextFromPdfDemo />

## The Basic Workflow

Extracting text from a PDF page involves several steps:

1. Initialize PDFium
2. Load the PDF document
3. Load the specific page
4. Create a text page object for text extraction
5. Get the character count
6. Allocate a buffer for the text
7. Extract the text into the buffer
8. Convert the text from UTF-16LE to a JavaScript string
9. Clean up resources

## Prerequisites

The examples in this guide use the `initializePdfium` helper function from our [Getting Started](/docs/pdfium/getting-started#initializing-pdfium) guide. This function properly initializes the PDFium library, including calling the required `PDFiumExt_Init()` function. Make sure to include this initialization code in your application before trying the examples below.

## Example Implementation

Here's a complete implementation for loading a PDF document and extracting text from its pages:

```typescript
import { initializePdfium } from './initialize-pdfium';
// Note: The initializePdfium function is a helper that initializes the PDFium library.
// For the full implementation, see: /docs/pdfium/getting-started

/**
 * Loads a PDF document and returns an object with methods to access and extract text from it.
 * This example focuses solely on extracting text from PDF pages.
 */
export async function loadPdfForTextExtraction(pdfData: Uint8Array) {
  // Initialize PDFium
  const pdfium = await initializePdfium();
  
  // Allocate memory for the PDF data
  const filePtr = pdfium.pdfium.wasmExports.malloc(pdfData.length);
  pdfium.pdfium.HEAPU8.set(pdfData, filePtr);
  
  // Load the document
  const docPtr = pdfium.FPDF_LoadMemDocument(filePtr, pdfData.length, 0);
  
  if (!docPtr) {
    const error = pdfium.FPDF_GetLastError();
    pdfium.pdfium.wasmExports.free(filePtr);
    
    // Handle password-protected documents
    if (error === 4) {
      return {
        hasPassword: true,
        pageCount: 0,
        close: () => {}, // No-op since no document was loaded
        extractText: async () => { throw new Error('Document is password protected'); }
      };
    }
    
    throw new Error(`Failed to load PDF: ${error}`);
  }
  
  // Get page count
  const pageCount = pdfium.FPDF_GetPageCount(docPtr);
  
  // Return an object with document info and text extraction capabilities
  return {
    hasPassword: false,
    pageCount,
    
    // Close the document and free resources
    close: () => {
      pdfium.FPDF_CloseDocument(docPtr);
      pdfium.pdfium.wasmExports.free(filePtr);
    },
    
    // Extract all text from a specific page
    extractText: async (pageIndex: number): Promise<string> => {
      // Check if the page index is valid
      if (pageIndex < 0 || pageIndex >= pageCount) {
        throw new Error(`Invalid page index: ${pageIndex}. Document has ${pageCount} pages.`);
      }
      
      // Load the page for rendering
      const pagePtr = pdfium.FPDF_LoadPage(docPtr, pageIndex);
      if (!pagePtr) {
        throw new Error(`Failed to load page ${pageIndex}`);
      }
      
      try {
        // Load the page text (a separate structure for text extraction)
        const textPagePtr = pdfium.FPDFText_LoadPage(pagePtr);
        if (!textPagePtr) {
          throw new Error(`Failed to load text for page ${pageIndex}`);
        }
        
        try {
          // Get the character count on the page
          const charCount = pdfium.FPDFText_CountChars(textPagePtr);
          if (charCount <= 0) {
            return ''; // No text on this page or error
          }
          
          // Allocate a buffer for the text (+1 for null terminator)
          const bufferSize = (charCount + 1) * 2; // UTF-16, 2 bytes per character
          const textBufferPtr = pdfium.pdfium.wasmExports.malloc(bufferSize);
          
          try {
            // Extract the text (PDFium uses UTF-16LE encoding)
            const extractedLength = pdfium.FPDFText_GetText(
              textPagePtr, 
              0, // Start index
              charCount, // Character count to extract
              textBufferPtr // Output buffer
            );
            
            // If text was extracted, convert from UTF-16LE to JavaScript string
            if (extractedLength > 0) {
              // Convert UTF-16 array to JavaScript string
              return pdfium.pdfium.UTF16ToString(textBufferPtr);
            }
            
            return ''; // No text extracted
          } finally {
            // Clean up text buffer
            pdfium.pdfium.wasmExports.free(textBufferPtr);
          }
        } finally {
          // Clean up text page
          pdfium.FPDFText_ClosePage(textPagePtr);
        }
      } finally {
        // Clean up page
        pdfium.FPDF_ClosePage(pagePtr);
      }
    }
  };
}
```

## Usage Example

Here's a simple example of how to use the text extraction functionality:

```javascript
// Load a PDF file and extract text from its pages
async function extractTextFromPdf() {
  try {
    // Fetch the PDF file
    const response = await fetch('sample.pdf');
    const arrayBuffer = await response.arrayBuffer();
    const pdfData = new Uint8Array(arrayBuffer);
    
    // Load the document
    const pdfDocument = await loadPdfForTextExtraction(pdfData);
    
    // Get the page count
    const pageCount = pdfDocument.pageCount;
    console.log(`PDF has ${pageCount} pages`);
    
    // Extract text from each page
    for (let i = 0; i < pageCount; i++) {
      const text = await pdfDocument.extractText(i);
      console.log(`--- Page ${i + 1} ---`);
      console.log(text || '[No text on this page]');
    }
    
    // Clean up when done
    pdfDocument.close();
  } catch (error) {
    console.error('Error extracting text:', error);
  }
}

extractTextFromPdf();
```

## Memory Management Considerations

When extracting text from PDF pages, proper memory management is critical to prevent memory leaks:

1. **Text page cleanup**: Always close text pages using `FPDFText_ClosePage` when you're done with them.

2. **Page cleanup**: Always close pages using `FPDF_ClosePage` when you're done with them.

3. **Buffer cleanup**: Free any allocated memory buffers using `pdfium.pdfium.wasmExports.free()` when you're done with them.

4. **Nested try/finally blocks**: Use nested try/finally blocks to ensure proper cleanup of resources in the correct order, even if errors occur.

## Text Extraction Limitations

When working with text extraction in PDFs, be aware of these limitations:

1. **Text recognition**: PDFium can only extract text that is actually stored as text in the PDF. It cannot perform OCR (Optical Character Recognition) on scanned documents or images.

2. **Text order**: The order of extracted text might not always match the visual order on the page, especially for complex layouts or documents with multiple columns.

3. **Special characters**: Some special characters or symbols might not be correctly extracted or might be represented differently in the extracted text.

4. **Password protection**: Text cannot be extracted from password-protected PDFs without providing the correct password.

## Related Functions

- [FPDF_LoadMemDocument](/docs/pdfium/functions/FPDF_LoadMemDocument) - Load a PDF document from memory
- [FPDF_GetPageCount](/docs/pdfium/functions/FPDF_GetPageCount) - Get the number of pages in a document
- [FPDF_LoadPage](/docs/pdfium/functions/FPDF_LoadPage) - Load a specific page from a document
- [FPDFText_LoadPage](/docs/pdfium/functions/FPDFText_LoadPage) - Load a page for text extraction
- [FPDFText_CountChars](/docs/pdfium/functions/FPDFText_CountChars) - Get the number of characters on a page
- [FPDFText_GetText](/docs/pdfium/functions/FPDFText_GetText) - Extract text from a page
- [FPDFText_ClosePage](/docs/pdfium/functions/FPDFText_ClosePage) - Close a text page and release resources
- [FPDF_ClosePage](/docs/pdfium/functions/FPDF_ClosePage) - Close a page and release resources
- [FPDF_CloseDocument](/docs/pdfium/functions/FPDF_CloseDocument) - Close a document and release resources
- [FPDF_GetLastError](/docs/pdfium/functions/FPDF_GetLastError) - Get the error code for the last failed operation 